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Abstract 

Since  some  of  the  methods  for  identifying  signaling  genes  in  microarray  experiments 
are  hiearchical  networks  of  often  simple  methods,  it  seems  natural  to  use  simulation 
to  understand  how  well  these  methods  perform  under  idealized  circumstances.  In  this 
study  we  assume  that  expression  levels  for  signaling  genes  are  distributed  N(n,  1),  p  >  0 
and  are  distributed  A (0,1)  for  the  non-signaling  ones.  Signaling  genes  are  identified 
simply  by  taking  the  top  N  ranked  t  scores.  Under  this  set-up  we  evaluate  the  probability 
that  the  top  10  scores  will  correctly  identify  at  least  M  good  genes  as  a  function  of  the 
gene  signaling  level  p,  the  number  of  samples  from  the  control  and  treatment  populations 
and  the  number  of  genes  that  carry  the  signal. 

In  spite  of  these  simplicity  of  the  model  we  think  some  insight  is  gained  about  the 
relationships  between  the  sample  size  and  the  signaling  level  at  which  some  specified  per¬ 
formance  is  obtained.  The  conclusion,  under  the  assumption  of  equal  signal  strengths, 
is  that  there  is  considerable  payoff  for  “genefinding”  in  the  first  few  doublings  of  sample 
size,  say  from  2  to  4  and  perhaps  to  8.  The  reduction  of  signal  level  required  to  give 
specified  “genefinding”  performance  continues  and  appears  to  agree  with  the  anticipated 
asymptotic  reduction  by  \/2  for  each  doubling. 

The  purpose  of  computing  is  insight.  R.W.  Hamming 


1  Introduction 

Our  simplistic  view  of  the  microarray-based  genefinding  process  is  the  following.  Genetic 
material  from  control  and  treatment  experiments  is  applied  to  microarrays  and  the  expres¬ 
sion  levels  are  read  (we  do  not  need  to  discuss  the  several  technologies  for  doing  this,  al¬ 
though  this  simulation  was  motivated  by  thinking  about  the  Affymetrix  technology).  Based 
on  statistical  tests  comparing  treatment  and  control  data,  we  wish  to  identify  candidate 
genes  that  signal  the  treatment.  Of  course  some  candidates  may  actually  be  biologically 
unrelated  to  the  treatment  due  to  randomness  in  the  statistical  decision  process  (false  pos¬ 
itives)  .  Hence  we  are  motivated  to  measure  the  performance  of  procedures  for  forming  lists 
of  candidate  genes. 

Since  we  do  not  know  the  real  underlying  distributions  of  expression  level,  we  make  the 
simplifying  assumption  they  are  normal  and  that  the  variations  are  independent  from  gene 
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to  gene  and  that  different  samples  are  independent.  We  leave  the  interpretation  “sample” 
to  the  reader,  since  it  could  be  considered  a  replication  for  a  single  subject  or  possibly  a 
single  experiment  for  one  of  several  subjects.  Only  the  the  mean  levels  of  the  signaling  genes 
change  during  treatment.  The  following  questions  most  certainly  arise  in  any  experimental 
program  whose  goal  is  to  discover  genes  that  signal  a  treatment  :  how  many  genes  signal 
the  treatment  and  how  strong  is  the  signaling?  Without  any  biological  experience  we  must 
be  prepared  to  think  that,  depending  on  the  treatment  conditions,  the  number  of  signaling 
genes  can  be  many  or  few  and  the  signaling  strengths  can  be  strong  or  weak,  or  a  mixture 
thereof. 

To  illustrate  this  point,  Figure  1  presents  simulated  \t\  scores  for  (treatment  -  control) 
expression  level  differences  of  100  genes  in  which  there  are  10  signaling  genes  randomly 


|t|  scores  for  NSAMP=2 


,  *  **XX  I*  XXV*P*S^y  -P<  X  XvXX  *  XX  W  V  XXxX^iX  xxx  *^xx  vxX 
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Figure  1:  Simulated  \t\  scores  for  2  paired  samples  of  control  and  treat¬ 
ment  where  control  gene  levels  are  N( 0, 1)  and  signaling  treatment  genes 
are  lV(/i,  1).  The  10  signaling  /i  values  are  shown  in  Figure  2.  Signaling 
genes  shown  in  red.  Only  2  signaling  genes  appear  in  the  top  10  values  of 
1*1- 


placed  within  the  100  and  having  yu’s  as  shown  in  Figure  2.  We  used  the  absolute  value  \t\ 
here  for  ease  in  plotting.  For  Figure  1  there  are  2  samples  (replications)  per  gene  for  both 
control  and  treatment.  Gene  expression  levels  under  the  control  condition  are  N( 0, 1)  as 
they  are  also  for  non-signaling  genes  under  the  treatment  condition.  Expression  levels  are 
iV(/i,  1)  for  the  signaling  genes  under  the  treatment  condition.  Only  2  of  the  10  signaling 
genes  appear  in  the  top  10  values  of  \t\. 
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values  for  signalling  genes 


Figure  2:  The  10  signaling  p  values  for  the  simulation  experiment  of  Figure 

1. 

But  as  shown  in  Figure  3,  by  taking  8  paired  samples  of  control  and  treatment  and 
computing  |f|  scores,  more  of  the  signaling  genes  are  perceptible;  now  6  signaling  genes 
appear  in  the  top  10. 

This  has  demonstrated  the  main  idea.  In  the  general  case,  since  we  do  not  know  a-priori 
where  we  are  operating  in  //.-space,  a  reasonable  strategy  is  to  begin  with  a  small  number  of 
samples  and  see  if  any  (or  how  many)  very  significant  f-scores  are  found.  If  there  are  many, 
and  if  enough  of  these  make  biological  sense,  then  perhaps  we  are  done.  But  if  only  a  few 
are  significant,  taking  adjustments  for  multiple  hypotheses  into  account,  then  more  samples 
would  be  called  for.  Taking  more  samples  lets  us  perceive  smaller  values  of  p  in  the  noise. 
A  major  question  is  how  much  further  down  in  p  can  we  see  by  an  increase  in  sample  size? 

In  this  first  simulation,  which  is  motivated  by  Figures  1-3,  we  try  to  get  an  appreciation 
of  the  relationship  between  the  correct  identification  of  signaling  genes  and  (1)  the  number 
of  signaling  genes,  (2)  the  strength  of  the  signaling  genes  and  (3)  the  number  of  samples  in 
the  control  and  treatment  groups.  In  all  cases  we  assume  the  p’s  for  the  signaling  genes  are 
all  the  same,  which  is  yet  a  simpler  case  than  that  of  Figure  1.  We  declare  the  signaling 
genes  to  be  those  having  the  highest  10  t  scores.  Choosing  the  top  10  is  meant  to  represent 
the  case  in  which  only  a  few  genes  may  be  expected  to  signal  the  treatment  condition.  Other 
simulations  are  in  progress  for  which  many  more  genes  are  thought  to  signal  the  treatment 
condition.  Also,  in  the  remainder  we  use  the  highest  t  values  rather  than  |f|,  as  in  the 
previous  paragraphs,  because  we  have  set  p  >  0  for  the  signaling  genes. 

2  The  simulation 

We  provide  estimates  of  two  “genefinding”  performance  measures  for  the  simple  method  of 
choosing  Ncfwose  genes  from  a  large  total  Nq  by  taking  those  genes  giving  the  top  Ncfwose 
values  of  t  scores  computed  (for  each  gene)  from  microarray  expression  level  data.  In  this 
case  it  is  assumed  the  control  and  treatment  samples  are  paired  so  the  t  scores  are  based 
on  the  sample  mean  and  variance  of  the  differences  between  control  and  treatment.  This 
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|t|  scores  for  NSAMP=8 


Figure  3:  Simulated  \t\  scores  for  8  paired  samples  of  control  and  treat¬ 
ment  where  control  gene  levels  are  N( 0, 1)  and  signaling  treatment  genes 
are  JV(/i,  1).  The  10  signaling  /i  values  are  shown  in  Figure  2.  Signaling 
genes  shown  in  red.  Now  6  signaling  genes  appear  in  the  top  10  values  of  \t\. 


assumption  permits  each  sample  to  have  a  possibly  random  shift  in  mean  that  is  common 
to  the  control  and  treatment. 

The  performance  measures  are: 

1.  Expected  number  of  signaling  genes  found. 

2.  Probability  that  at  least  K  signaling  genes  will  be  found. 

Both  of  these  quantities  can  be  estimated  from  the  empirical  distribution  of  Nc.  a  random 
variable  describing  the  number  of  correctly  identified  signaling  genes  found  in  an  experiment. 

The  gene  signaling  model.  All  gene  expression  levels  for  the  control  group  are  made 
i.i.d  normal  with  mean  0  and  variance  1.  We  assume  that  only  Ngooci  genes  carry  the  signal 
for  the  treatment  and  they  are  randomly  chosen  from  all  the  genes.  All  gene  expression 
levels  for  the  treatment  group  are  also  independent  and  normal  with  variance  1,  and  all 
except  the  Ngooci  signaling  genes  have  mean  0  also.  All  the  Ngood  signaling  genes  have  mean 
//  >  0.  a  parameter  that  may  be  interpreted  as  signal  level. 

In  way  of  criticism ,  the  normal  constant  variance  model  is  much  too  simplistic  although 
the  results  here  would  be  unchanged  if  the  genes  had  different  variances  provided  the  vari¬ 
ances  of  control  and  treatment  for  each  fixed  gene  were  identical.  That  the  signaling  genes 
all  have  the  same  signal  level  p  >  0  is  also  much  too  simplistic.  It  may  be  more  realistic 
for  the  n' s  for  the  signaling  genes  to  be  governed  by  a  probability  distribution,  or  even 
deterministically  controlled  as  in  Figure  2.  The  use  of  the  f,  which  depends  on  normality, 
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can  also  be  replaced  with  a  non-parametric  rank  test  at  come  cost  in  efficiency.  But  the 
main  objective  is  to  illustrate  how  well  this  process  works  in  an  idealized  case,  and  to  see 
the  change  in  performance  as  a  function  of  parameter  values.  So  for  now,  these  criticisms 
just  motivate  future  improvements. 

Gene  identification.  For  each  gene,  the  t  statistic  will  be  based  on  the  sample  mean  of 
differences  between  control  and  treatment,  and  on  the  corresponding  sample  variance  of  the 
differences.  This  is  a  result  of  the  assumption  that  the  control-treatment  samples  are  paired. 
True  variances  can  also  be  dependent  on  the  sample  if  we  assume  the  shift  in  mean  scales 
with  the  sigma.  Since  taking  logs  of  the  data  removes  scale-only  effects  of  this  sort,  one  can 
interpret  the  simulated  variates  to  be  logged  data  of  this  type. 

Parameter  definitions  and  values.  Some  of  these  were  already  defined  but  we  list  all 
here. 


Symbol 

Values 

Description 

Ng 

12,000 

Number  of  genes 

Ngood 

5,10,20,40 

Number  of  signaling  genes 

N choose 

10 

Number  of  genes  chosen  by  rank  method 

N samp 

2,4,8,16,32 

Sample  size  of  both  control  and  treatment  groups 

N trials 

1000 

Number  of  trials  in  the  monte-carlo  simulation 

P 

as  needed 

Signal  level  of  the  signaling  genes 

Nc 

Number  of  correctly  identified  genes 

Quantities  computed.  For  each  setting  of  the  parameters  Ng00(i1  Nsamp  and  //,.  we  com¬ 
pute  the  empirical  distribution  of 

Nc  =  No.  correctly  identified  genes. 

Denote 

*  _  #[Nc  =  J] 

Pj  jy  i  J  0 )  1 )  ■  ■  ■■  j  N choose  ■ 

trials 

From  these  empirical  distributions  we  plot  the  quantities 

1.  the  mean  of  the  empirical  distribution,  rhc  —  J2j  JPji 

2.  P[NC  >k\  =  T,j>kPji  for  k  =  1,5,9 

as  a  function  of  log2(/i)  for  each  of  the  conditions  Ngoo(i  =  5, 10,20,40.  This  gives  a  total 
of  16  plots.  We  regret  the  large  number  of  plots,  but  include  them  so  because  they  may 
permit  further  analysis  as  in  the  next  paragraph. 


6 


Discussion  of  the  plots  and  formation  of  Table  1.  The  plots  are  presented  in  Figures 
5  through  12.  First  note  that  the  bottom  plot  of  Figure  6  {P[NC  >  9]  =  .5)  is  completely 
void  because  the  event  Nc  >  9  is  impossible  if  Ngoo<i  =  5.  From  these  figures  we  can 
estimate  the  effect  of  sample  size  for  a  fixed  number  of  good  genes  {Ngoo(j).  or  we  can  see 
the  performance  as  a  function  of  Ng00(i  for  fixed  sample  size.  Of  course  in  real  experiments 
we  (the  experimenter)  have  control  of  sample  size  but  the  values  of  Ng00ci  are  unknown  to 
us.  Simulation  may  help  us  parametrically  study  the  effect  of  Ng00(i  on  the  outcomes.  To 
illustrate  the  use  of  these  plots,  we  will  investigate  the  effect  of  sample  size. 

For  this  simulation  we  have  chosen  the  parameters  Ngooci  =  5,10,20,40  and  Nsamp  = 
2, 4, 8, 16,  32  to  be  finite  sequences  that  increase  by  a  factor  of  two.  This  permits  us  to 
estimate  the  change  in  signal  level  //  required,  as  a  function  of  doubling  sample  size  Nsarnp . 
to  meet  some  performance  specification.  For  example,  in  the  top  of  Figure  9,  let  us  examine 
the  changes  in  //,  corresponding  to  the  increasing  of  Nsamp  from  4  to  8.  We  denote  pm.r} 
as  the  empirical  solution  to  rh(g)  =  5.  Thus  pm, 5  moves  from  about  log2(g)  =  2.6  to 
log2(/i)  =  1,  or  a  factor  of  2L6  =  3.03.  Then  to  increase  Nsamp  from  8  to  16  decreases  the 
required  log2(g)  by  1,  or  a  factor  of  2  (table  1  gives  a  factor  of  1.8).  Note  that  the  Iog2(/i) 
required  for  Nsamp  =  2  is  not  available  on  this  scale.  We  can  also  determine  the  ratio  of 
//4s  required  to  maintain  the  probabilities  P[NC  >  k]  =  J2j>kPj  =  P()  where  we  use  Po  =  .5. 
Table  1  results  from  the  application  of  this  procedure  to  all  of  the  figures. 

Discussion  of  Table  1.  Note  first  that  the  top  part  of  Table  1  has  many  asterisks,  each 
of  which  indicates  that  the  quantity  could  not  be  determined  from  the  computed  curves. 
In  some  cases  where  //,  values  were  not  initially  chosen  well,  additional  simulations  were 
done  to  supply  the  needed  parameter  values.  However,  the  many  asterisks  in  the  section 
labeled  Ng00ci  =  5  occur  because  Nc>  9  is  impossible  if  Ng00,i  =  5  and  achieving  mc  =  5  or 
P[NC  >  5]  =  .5  is  not  to  be  expected  for  finite  g. 

The  main  observation  to  be  made  from  the  table  is  that  for  all  of  the  performance 
measures  used,  the  sample  size  doubling  of  2  to  4  has  a  much  larger  effect  on  the  decrease 
in  discernable  signal  level  than  does  the  one  from  4  to  8  and  especially  the  latter  doublings, 
say  from  16  to  32.  Here  we  take  discernable  signal  level  as  the  value  of  g  that  gives  some 
specified  level  of  performance.  Note  the  //  ratios  for  the  16  to  32  doubling  are  all  near 
1.5  whereas  we  anticipate  a  diminishing  of  \[2  in  the  limit  as  the  sample  size  tends  to 
infinity.  This  asymptotic  value  of  \[2  would  occur,  for  example,  whenever  the  sampling 
distributions  for  mc  and  pj  have  means  that  are  constant  with  respect  to  sample  size  and 
are  symmetrically  distributed  about  those  means.  Since  we  can  expect  the  estimators  mc 
and  pj  to  be  asymptotically  normal,  the  preceding  condition  would  hold  asymptotically. 
The  \/2  dependence  comes  simply  from  the  diminishing  of  sample  variance  due  to  doubling 
sample  size. 

To  be  a  little  more  explicit,  suppose  the  sample  size  N  is  sufficiently  large  so  that 
mc(2kN)  is  normal  ( k  >  1)  with  mean  mo  and  variance  a2 /2k .  Denote  zp  as  the  100  x  pth 
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A  N samp 

H  ratio  for 
mc  =  5 

/i  ratio  for 

P[NC  >  1]  =  .5 

//,  ratio  for 

P[NC  >  5]  =  .5 

fi  ratio  for 

P[NC  >  9]  =  .5 

Ngood  —  5 

2  to  4 

*  1* 

74.94 

*  1* 

*j* 

4  to  8 

* /* 

4.9/1.71=2.86 

*  j* 

*  !* 

8  to  16 

*  1* 

1.71/0.97=1.76 

*  j* 

*  j* 

16  to  32 

*  1* 

.97/. 61=1. 59 

*  j* 

*  j* 

A ^ good  —  10 

2  to  4 

77.38 

45/3.65=12.3 

78.03 

725.20 

4  to  8 

7.38/2.24=3.29 

3.65/1.38=2.64 

8.03/2.38  =  3.37 

25.20/4.71=5.35 

8  to  16 

2.24/1.22=1.83 

1.38/0.78=1.77 

2.38/1.26  =  1.89 

4.71/2.17=2.17 

16  to  32 

1.22/0.78=1.56 

0.78/0.51=1.53 

1.26/0.81  =  1.55 

2.17/1.30=1.67 

Ngood  ~  20 

2  to  4 

240/5.18=46.3 

39.42/2.75=14.3 

200/5.58=35.8 

713.90 

4  to  8 

5.18/1.75=2.96 

2.75/1.09=2.52 

5.58/1.83=3.05 

13.90/3.05=4.56 

8  to  16 

1.75/0.97=1.80 

1.09/0.64=1.70 

1.83/1.01=1.81 

3.05/1.49=2.05 

16  to  32 

0.97/0.62=1.56 

0.64/. 41=1. 56 

1.01/0.64=1.58 

1.49/0.91=1.64 

Ngood  =  40 

2  to  4 

120/3.87=31.0 

20.77/2.06=10.08 

100/4.17=24.0 

*/  10.04 

4  to  8 

3.87/1.42=2.73 

2.06/0.87=2.37 

4.17/1.49=2.80 

10.04/2.47=4.21 

8  to  16 

1.42/0.79=1.80 

0.87/0.51=1.71 

1.49/0.83=1.80 

2.47/1.23=2.01 

16  to  32 

0.79/0.51=1.55 

0.51/.33=1.55 

0.83/0.53=1.57 

1.23/0.75=1.64 

Table  1:  Ratios  of  signaling  level  ji  required  to  achieve  rhc  =  J2j  JPj  =  5, 
and  P[NC  >  k\  =  J2j>kPk  =  -5,  for  k  =  1,5,9.  An  asterisk  indicates  that 
the  quantity  could  not  be  determined  from  the  computed  curves.  In  some 
cases  where  //,  values  were  not  initially  chosen  well,  additional  simulations 
were  done  for  needed  parameter  values.  However,  the  many  asterisks  in  the 
section  labeled  Ng00ci  =  5  occur  because  achieving  rhc  =  5  or  P[NC  >  5]  =  .5 
is  not  to  be  expected  for  finite  fi. 
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percentile  of  a  standard  normal.  Then 


Pr 


rnc{N )  -  mo 


zn 


=  1  -p 


Pr 


mc{2kN)  —  mo 


P  Z'r, 


=  l-p. 


a/V¥  "  P 

Thus  the  change  in  the  100  x  pth  percentile  threshold  for  increasing  from  N  to  2 N  is 


m0 


+  zpa  —  mo  —  zvu/V2  =  zpa(l  —  1/V^) 


and  for  increasing  from  2 N  to  4iV  it  is 

mo  +  zpolV2  —  mo  —  zpa /2  =  zpa{l  —  l/\/2)  /  \/2. 

This  produces  a  ratio  of  changes  in  the  100  x  pth  percentiles  due  to  doubling  sample  sizes 
of  1/V2. 

Since  the  quantities  of  interest,  mc  and  pj  both  are  with  respect  to  the  random  variable 
Nc  (no.  good  genes  identified),  we  show  empirical  distributions  of  Nc  from  the  simulations 
in  the  four  plots  of  Figure  4.  These  plots  show  that  the  empirical  distributions  of  Nc  are  far 
from  symmetric  (and  far  from  Gaussian)  for  the  cases  Ns  =  4,8,16  but  looks  much  more 
Gaussian  for  Ns  =  32. 
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Figure  4:  Empirical  distributions  of  Nc  for  fi  =  .8 ,Ngoo(i  —  20  and  Ns  = 
4,8,16,32. 


9 


We  remind  that  these  results  are  based  on  equal  signal  strengths,  but  under  this  assump¬ 
tion,  the  simulation  implies  that  there  is  considerable  payoff  in  the  reduction  of  discernable 
signal  level  in  going  from  2  to  4  and  perhaps  to  8  experiments.  And  although  expensive  to 
go  from  2  to  4  or  to  8,  the  expense  is  much  less  than  from  16  to  32. 


Confidence  limits  The  95%  confidence  interval  for  estimating  a  probability  Po  =  .5 
with  a  sample  of  1000  is  [.469,  .531]  or  roughly  .5  ±  .03.  This  may  be  transformed  back  to 
a  statement  about  //,  using  the  experimentally  determined  curves  (in  the  16  Figures). 

The  confidence  interval  for  mc  is  estimated  using  the  sample  variances  aj.  =  Jf-  (j  — 
77i c)2Pj  which  were  found  to  be  at  most  4.5  at  //,'s  that  gave  mc  =  5.  Then  using  asymptotic 
normality  of 


rhc  =  X JPj 

3 


1 

N trials 


N trials 

E 


n—1 


where  the  number  of  correctly  found  in  each  trial,  Nc  ,n  are  considered  independent  random 
variables,  the  estimate  of  standard  error  for  mc  is  \/4.5  x  10-3  =  .067.  From  the  normality 
assumption,  the  95%  confidence  interval  around  vnc  =  5  is  within  the  interval  mc  ±  .13. 


Comments,  Suggestions  for  improvement.  The  purpose  of  this  exercise  was  to  see  if 
simulation  could  help  our  understanding  about  the  interplay  of  the  parameters  Nsampi  Ngooci 
and  N cjWose  in  the  simple  genefinding  algorithm  of  choosing  the  top  NcjWOse  t  scores.  The 
author  welcomes  comments,  suggestions  and  new  questions  that  arise  from  this  small  effort. 
The  following  items  of  improvement  seem  clearly  interesting. 

1.  Evaluate  the  effect  of  other  signal  strength  distributions  (here  it  is  constant,  i.e. , 
uniform) . 

2.  Base  the  random  expression  levels  on  empirical  distributions  from  observed  data  or 
on  distributions  whose  parameters  are  determined  by  observed  data. 

3.  Use  simulation  to  help  understand  how  the  elements  of  the  list  change  as  sample  size 
is  increased.  For  example,  given  a  realization  from  Nsamp  =  2,  what  should  we  expect 
from  our  gene  list,  t  scores,  etc,  when  we  go  to  Nsamp  =  4? 
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11 

Figure  5:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  =  5,  Ntriais  =  1000.  Genes 
chosen  from  top  10  t-scores  based  on  paired  observations. 
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(b)  Probability  of  identifying  at  least  9  good  genes. 

12 

Figure  6:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  =  5,  Ntriais  =  1000.  Genes 
chosen  from  top  10  t-scores  based  on  paired  observations.  Note  Nc  >  9  is  impossible  for 
Ngood  —  5- 
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(b)  Probability  of  identifying  at  least  1  good  gene. 
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Figure  7:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  —  10,  Ntriais  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 
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(b)  Probability  of  identifying  at  least  9  good  genes. 
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Figure  8:  Results  of  genefinding  simulation,  lVc/>oose  =  10,  N g00d  =  10,  Ntruds  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 
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Figure  9:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  —  20,  Ntriais  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 


empirical  prob  of  choosing  at  least  5  good  genes  NUMSIM  :  1000 


log2(|i)  in  signalling  genes 


(a)  Probability  of  identifying  at  least  5  good  genes. 


empirical  prob  of  choosing  at  least  9  good  genes  NUMSIM  :  1000 


log2([i)  in  signalling  genes 


(b)  Probability  of  identifying  at  least  9  good  genes. 
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Figure  10:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  =  20,  Ntriais  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 
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(b)  Probability  of  identifying  at  least  1  good  gene. 
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Figure  11:  Results  of  genefinding  simulation,  Nch00se  =  10,  ^g0od  =  40,  Ntriais  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 
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Figure  12:  Results  of  genefinding  simulation,  Nch00se  =  10,  N g00d  =  40,  Ntriais  =  1000. 
Genes  chosen  from  top  10  t-scores  based  on  paired  observations. 


